PREDICTION OF NEXT WEEK'S COVID 19 DEATHS BASED ON THIS WEEK'S DATA¶
SACKO Kalil, Master Student at University of applied Science Bochum, major : Computer Science¶
Introduction¶
This project focuses on predicting COVID-19-related deaths for the upcoming week based on data from the current week. It is a seminar project in the field of Big Data, inspired by a Kaggle competition. The competition that serves as the foundation for this project can be found at the following link: https://www.kaggle.com/competitions/Covid19-Death-Predictions/overview.
The goal of the project is to make accurate predictions while gaining significant insights from the data. Modern analytical methods and machine learning techniques are employed to address the challenges posed by this real-world scenario.
-------------------------------APPROACH-----------------------------¶
I. EXPLORATORY DATA ANALYSIS¶
Objective:¶
To understand the available data as thoroughly as possible in order to define a modeling strategy.
Basic Checklist (not exhaustive):¶
I-I. Basic Analysis (Analysis of the Data Structure)¶
- Target Variable
- Number of Rows and Columns
- Variable Types
- Descriptive Analysis
- Analysis of Missing Values
- Analysis of Outliers
- Analysis of Variable Distributions
I-II. Content Analysis:¶
- Objective:
To examine the relationships between variables and identify potential hypotheses for testing
- Exploration of the Target Variable
- Relationships between Variables and the Target Variable
- Relationships among Independent Variables
- Temporal and Geographical Analysis
- Vaccination and Its Effects
I-III. Hypotheses to be tested:¶
Null Hypotheses (H₀)¶
- Hypothesis 1: Do regions with higher vaccination rates have lower weekly death counts?
- Hypothesis 2: Do regions with higher COVID-19 case numbers have higher death rates?
- Hypothese 3 : Hypothesis 3: Do regions with higher weekly death rates also have higher death rates in the following week
- ETC....
II. PRE-PROCESSING¶
Objective:¶
To transform the data into a format suitable for machine learning.
Basic Checklist (not exhaustive):¶
- Creation of Training and Validation(pre-test) Datasets
- Encoding
- Handling NaN Values: dropna(), Imputation
- Treatment of Outliers that Negatively Affect the Model
- Feature Selection
- Feature Engineering
- Feature Scaling
III. MODELLING AND TRAINING¶
Objective:¶
To Develop a machine learning model that fulfills the ultimate goal.
Basic Checklist (not exhaustive):¶
- Definition of an Evaluation Function
- Training Various Models
- Learning Curve
- Coefficient of Determination (R² Score)
- Error Analysis and Returning to Preprocessing/EDA (Optional)
- Optimization: Using GridSearchCV and/or RandomizedSearchCV , Applying Ensemble Learners
IV. TEST PHASE¶
- Final test of the selected best model(s) with a new dataset (test set).
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
pd.set_option('display.max_rows', 111)
pd.set_option('display.max_columns', 111)
data = pd.read_csv('train.csv')
Number of rows and columns¶
data
| Id | Location | Weekly Cases | Year | Weekly Cases per Million | Weekly Deaths | Weekly Deaths per Million | Total Vaccinations | People Vaccinated | People Fully Vaccinated | Total Boosters | Daily Vaccinations | Total Vaccinations per Hundred | People Vaccinated per Hundred | People Fully Vaccinated per Hundred | Total Boosters per Hundred | Daily Vaccinations per Hundred | Daily People Vaccinated | Daily People Vaccinated per Hundred | Next Week's Deaths | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 911530868 | World | 2372.0 | 2020 | 0.300 | 65.0 | 0.008 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 344.0 |
| 1 | 807936902 | World | 5023.0 | 2020 | 0.635 | 114.0 | 0.014 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 361.0 |
| 2 | 773590408 | World | 5612.0 | 2020 | 0.710 | 116.0 | 0.015 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 431.0 |
| 3 | 130466459 | World | 7580.0 | 2020 | 0.958 | 153.0 | 0.019 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 463.0 |
| 4 | 544040446 | World | 8983.0 | 2020 | 1.136 | 187.0 | 0.024 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 506.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 129151 | 541829605 | Zimbabwe | 464.0 | 2022 | 29.012 | 13.0 | 0.813 | 11949993.0 | 6297324.0 | 4601845.0 | 1050824.0 | 5665.0 | 74.72 | 39.37 | 28.77 | 6.57 | 354.0 | 1427.0 | 0.009 | 9.0 |
| 129152 | 969939474 | Zimbabwe | 471.0 | 2022 | 29.449 | 12.0 | 0.750 | 11958771.0 | 6299348.0 | 4605821.0 | 1053602.0 | 5295.0 | 74.77 | 39.39 | 28.80 | 6.59 | 331.0 | 1362.0 | 0.009 | 7.0 |
| 129153 | 667902340 | Zimbabwe | 450.0 | 2022 | 28.136 | 13.0 | 0.813 | NaN | NaN | NaN | NaN | 5316.0 | NaN | NaN | NaN | NaN | 332.0 | 1483.0 | 0.009 | 5.0 |
| 129154 | 961193720 | Zimbabwe | 277.0 | 2022 | 17.320 | 6.0 | 0.375 | 11974313.0 | 6305470.0 | 4611113.0 | 1057730.0 | 5358.0 | 74.87 | 39.43 | 28.83 | 6.61 | 335.0 | 1633.0 | 0.010 | 7.0 |
| 129155 | 832612563 | Zimbabwe | 277.0 | 2022 | 17.320 | 6.0 | 0.375 | 11984914.0 | 6310089.0 | 4614738.0 | 1060087.0 | 6190.0 | 74.94 | 39.45 | 28.85 | 6.63 | 387.0 | 2102.0 | 0.013 | 8.0 |
129156 rows × 20 columns
print("Number of rows in the dataset : ", len(data))
print("Number of columns in the dataset : ", len(data.columns))
print("THE TARGET VARIABLE IS 'Next Week's Deaths' : ")
Number of rows in the dataset : 129156 Number of columns in the dataset : 20 THE TARGET VARIABLE IS 'Next Week's Deaths' :
Variable Types¶
#Variable types
data.dtypes
Id int64 Location object Weekly Cases float64 Year int64 Weekly Cases per Million float64 Weekly Deaths float64 Weekly Deaths per Million float64 Total Vaccinations float64 People Vaccinated float64 People Fully Vaccinated float64 Total Boosters float64 Daily Vaccinations float64 Total Vaccinations per Hundred float64 People Vaccinated per Hundred float64 People Fully Vaccinated per Hundred float64 Total Boosters per Hundred float64 Daily Vaccinations per Hundred float64 Daily People Vaccinated float64 Daily People Vaccinated per Hundred float64 Next Week's Deaths float64 dtype: object
print(data.dtypes.value_counts())
data.dtypes.value_counts().plot(kind='pie', legend=True, figsize=(8,5))
float64 17 int64 2 object 1 Name: count, dtype: int64
<Axes: ylabel='count'>
Descriptive Analysis¶
#Descriptive Analysis
data.describe()
| Id | Weekly Cases | Year | Weekly Cases per Million | Weekly Deaths | Weekly Deaths per Million | Total Vaccinations | People Vaccinated | People Fully Vaccinated | Total Boosters | Daily Vaccinations | Total Vaccinations per Hundred | People Vaccinated per Hundred | People Fully Vaccinated per Hundred | Total Boosters per Hundred | Daily Vaccinations per Hundred | Daily People Vaccinated | Daily People Vaccinated per Hundred | Next Week's Deaths | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1.291560e+05 | 1.289430e+05 | 129156.000000 | 128288.000000 | 127898.000000 | 127247.000000 | 4.027000e+04 | 3.842900e+04 | 3.657400e+04 | 1.984700e+04 | 7.784000e+04 | 40270.000000 | 38429.000000 | 36574.000000 | 19847.000000 | 77840.000000 | 7.739100e+04 | 77391.000000 | 129156.000000 |
| mean | 5.502597e+08 | 9.520131e+04 | 2020.912919 | 1379.071563 | 1072.815494 | 10.845384 | 2.450093e+08 | 1.174096e+08 | 9.862264e+07 | 4.365458e+07 | 4.742138e+05 | 89.723652 | 43.424276 | 38.421866 | 20.160324 | 2702.988798 | 1.823430e+05 | 0.114669 | 1064.082776 |
| std | 2.599890e+08 | 6.329716e+05 | 0.739667 | 4013.421702 | 5287.848128 | 24.740908 | 1.032824e+09 | 4.932070e+08 | 4.303844e+08 | 1.862997e+08 | 2.593336e+06 | 74.209648 | 30.074617 | 29.197973 | 22.598973 | 3468.942102 | 1.061855e+06 | 0.202150 | 5251.447471 |
| min | 1.000006e+08 | 0.000000e+00 | 2020.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000 | 0.000000 |
| 25% | 3.253421e+08 | 1.040000e+02 | 2020.000000 | 18.291500 | 1.000000 | 0.050000 | 9.900038e+05 | 5.771390e+05 | 4.683338e+05 | 3.894150e+04 | 1.065000e+03 | 18.730000 | 12.810000 | 8.390000 | 0.295000 | 402.000000 | 3.280000e+02 | 0.012000 | 1.000000 |
| 50% | 5.497285e+08 | 1.084000e+03 | 2021.000000 | 188.584000 | 13.000000 | 1.709000 | 7.772026e+06 | 4.306986e+06 | 3.811594e+06 | 1.397130e+06 | 9.182500e+03 | 79.195000 | 46.330000 | 38.515000 | 9.780000 | 1498.000000 | 3.287000e+03 | 0.046000 | 13.000000 |
| 75% | 7.756598e+08 | 9.750000e+03 | 2021.000000 | 1046.367750 | 127.000000 | 10.934000 | 4.976308e+07 | 2.557442e+07 | 2.168825e+07 | 1.114840e+07 | 6.394475e+04 | 146.290000 | 70.380000 | 64.720000 | 36.860000 | 3840.000000 | 2.318000e+04 | 0.141000 | 125.000000 |
| max | 9.999993e+08 | 2.406618e+07 | 2022.000000 | 104220.239000 | 103568.000000 | 1040.710000 | 1.212140e+10 | 5.255161e+09 | 4.816606e+09 | 2.129044e+09 | 4.368841e+07 | 366.870000 | 128.780000 | 126.790000 | 125.850000 | 117862.000000 | 2.099974e+07 | 11.786000 | 102123.000000 |
Skewness:¶
- Most columns show a strong skewness, which is typical for data with minimal values close to zero and very high maximum values (long-tailed distribution). For example:
Weekly Cases, Weekly Deaths, Total Vaccinations, People Vaccinated, etc., exhibit a large difference between the median (50%) and the mean. This suggests a right-skewed distribution. The presence of very high maximum values (e.g., Weekly Cases = 24,066,180 and Weekly Deaths = 103,568) reinforces the idea that these distributions are not symmetrical.
Distributions around the Median:¶
- No column seems to be well-distributed around the median due to the strong skewness. For example, for Weekly Deaths, the median is 13, but the mean is 1,072, indicating the influence of some extremely high values.
The column Total Vaccinations per Hundred shows a moderate difference between the median (79.195) and the mean (89.72), suggesting a somewhat less skewed distribution, but still not perfectly symmetrical.
Analysis of missing values¶
#Analysis of missing values
import missingno as msno
plt.figure(figsize=(20, 8))
sns.heatmap(data.isna(), cbar=False)
msno.matrix(data)
<Axes: >
# Percentage of missing values
missing_rate = (data.isna().sum()/data.shape[0])*100
print(missing_rate.sort_values())
missing_rate.sort_values().plot.bar(rot=90, figsize=(10,6), color = 'red')
Id 0.000000 Year 0.000000 Next Week's Deaths 0.000000 Location 0.000000 Weekly Cases 0.164917 Weekly Cases per Million 0.672055 Weekly Deaths 0.974016 Weekly Deaths per Million 1.478058 Daily Vaccinations per Hundred 39.731797 Daily Vaccinations 39.731797 Daily People Vaccinated 40.079439 Daily People Vaccinated per Hundred 40.079439 Total Vaccinations per Hundred 68.820651 Total Vaccinations 68.820651 People Vaccinated 70.246059 People Vaccinated per Hundred 70.246059 People Fully Vaccinated per Hundred 71.682307 People Fully Vaccinated 71.682307 Total Boosters 84.633312 Total Boosters per Hundred 84.633312 dtype: float64
<Axes: >
#The columns that contain more than 60% NaN values in the rows.
missing_groesser_60 = data.columns[missing_rate > 60]
print("More than 60% of the rows contain null values.\n\n",missing_groesser_60)
print("\n********************************************************************************")
missing_zwischen_39_40 = data.columns[(missing_rate > 38) & (missing_rate < 41)]
print("\nUp to 40% of the rows contain null values.\n\n", missing_zwischen_39_40)
print("\n********************************************************************************")
missing_sehr_klein = data.columns[missing_rate < 1.5]
print("\nContain almost no null values.\n\n", missing_sehr_klein)
More than 60% of the rows contain null values.
Index(['Total Vaccinations', 'People Vaccinated', 'People Fully Vaccinated',
'Total Boosters', 'Total Vaccinations per Hundred',
'People Vaccinated per Hundred', 'People Fully Vaccinated per Hundred',
'Total Boosters per Hundred'],
dtype='object')
********************************************************************************
Up to 40% of the rows contain null values.
Index(['Daily Vaccinations', 'Daily Vaccinations per Hundred',
'Daily People Vaccinated', 'Daily People Vaccinated per Hundred'],
dtype='object')
********************************************************************************
Contain almost no null values.
Index(['Id', 'Location', 'Weekly Cases', 'Year', 'Weekly Cases per Million',
'Weekly Deaths', 'Weekly Deaths per Million', 'Next Week's Deaths'],
dtype='object')
---------------------------------------------------------------------------------------------------------------------¶
The columns 'Total Vaccinations', 'People Vaccinated', 'People Fully Vaccinated', 'Total Boosters', 'Total Vaccinations per Hundred', 'People Vaccinated per Hundred', 'People Fully Vaccinated per Hundred', and 'Total Boosters per Hundred' have NaN (null) values in more than 60% of the entire rows.
The columns 'Daily Vaccinations', 'Daily Vaccinations per Hundred', 'Daily People Vaccinated', and 'Daily People Vaccinated per Hundred' also contain up to 40% null values.
The columns 'Id', 'Location', 'Weekly Cases', 'Year', 'Weekly Cases per Million', 'Weekly Deaths', 'Weekly Deaths per Million', and 'Next Week's Deaths' have almost no (or very few – for the columns Weekly Cases, Weekly Cases per Million, Weekly Deaths, Weekly Deaths per Million) null values (0-5%).
Analysis of Outliers¶
# BOXPLOT OF COLUMNS
# for col in data.columns:
# if data[col].dtype in ['int64', 'float64']:
# Q1 = data[col].quantile(0.25) # Erstes Quartil
# Q3 = data[col].quantile(0.75) # Drittes Quartil
# median = data[col].median()
# plt.figure(figsize=(8, 5))
# sns.boxplot(data[col], boxprops=dict(facecolor='orange', edgecolor='black'))
# plt.title(f'{col}', fontsize=14)
# # Hinzufügen der statistischen Informationen als text.
# plt.xlabel(f'Q1: {Q1:.2f}, Median: {median:.2f}, Q3: {Q3:.2f}', fontsize=12)
# plt.ylabel('Werte', fontsize=12)
# plt.show()
cols_per_row = 2
num_cols = len([col for col in data.drop(["Location", "Id"], axis=1).columns] )
rows = (num_cols + cols_per_row - 1) // cols_per_row #Calculation of the required number of rows.
fig, axes = plt.subplots(rows, cols_per_row, figsize=(12, 5 * rows))
axes = axes.flatten()
for i, col in enumerate(data.drop(["Location", "Id"], axis=1).columns):
Q1 = data[col].quantile(0.25) # first Quartile
Q3 = data[col].quantile(0.75) # third Quartile
median = data[col].median() # Median
sns.boxplot(ax=axes[i], x=data[col], boxprops=dict(facecolor='orange', edgecolor='black'))
axes[i].set_title(f'{col}', fontsize=14)
axes[i].set_xlabel(f'Q1: {Q1:.2f}, Median: {median:.2f}, Q3: {Q3:.2f}', fontsize=10)
axes[i].set_ylabel('')
#Hiding unnecessary axes when the number of columns is odd.
for j in range(num_cols, len(axes)):
axes[j].set_visible(False)
plt.tight_layout()
plt.show()
For certain variables such as Weekly Cases, Weekly Deaths, Daily Vaccinations, Daily People Vaccinated, and Next Week's Deaths, the following observations can be made:
- The values seem to be highly concentrated around a specific value or within a certain range, with a significant number of outliers above the whiskers. These outliers are represented by circles.
- The box is extremely small, indicating that the interquartile range (IQR) is very narrow. This suggests that most of the data lies within a tight range around the median.
- Since whiskers are defined as the last values within 1.5xIQR above or below the quartiles (Q1 and Q3), for variables with a very small IQR and simultaneously a very large range of values (e.g., Weekly Cases), the whiskers are very close to the edges of the box or even merged with them, making them difficult to see.
Analysis of variables' distributions¶
# HISTOGRAMME OF THE VARIABLES
cols_per_row = 2
num_cols = len([col for col in data.drop(["Location", "Id"], axis=1).columns] )
rows = (num_cols + cols_per_row - 1) // cols_per_row #Calculation of the number of required rows
fig, axes = plt.subplots(rows, cols_per_row, figsize=(12, 5 * rows))
axes = axes.flatten()
for i, col in enumerate(data.drop(["Location", "Id"], axis=1).columns) :
sns.histplot(ax = axes[i], x = data[col], bins=50, kde=True, color='chocolate')
axes[i].set_title(f'{col}', fontsize=14)
#Hiding unnecessary axes when the number of columns is odd.
for j in range(num_cols, len(axes)):
axes[j].set_visible(False)
plt.tight_layout()
plt.show()
# SKEWNESS OF VARIABLES
skewdata = data.drop("Location", axis=1)
skewness = skewdata.skew()
print(f"Skewness of the features(columns) :\n{skewness}" )
Skewness of the features(columns) : Id -0.000918 Weekly Cases 17.334228 Year 0.139635 Weekly Cases per Million 7.806967 Weekly Deaths 9.096170 Weekly Deaths per Million 8.303667 Total Vaccinations 7.061840 People Vaccinated 6.846799 People Fully Vaccinated 7.143399 Total Boosters 6.990664 Daily Vaccinations 9.282717 Total Vaccinations per Hundred 0.518582 People Vaccinated per Hundred -0.023551 People Fully Vaccinated per Hundred 0.150049 Total Boosters per Hundred 0.934193 Daily Vaccinations per Hundred 5.123234 Daily People Vaccinated 10.693151 Daily People Vaccinated per Hundred 13.667387 Next Week's Deaths 9.087042 dtype: float64
plt.figure(figsize=(8,5))
sns.barplot(x = skewness.index, y = skewness.values, color='green')
plt.title("Skewness plot of Attributs")
plt.xticks(rotation=90, ha='right')
plt.show()
# PROBABILITY PLOT
import scipy.stats as stats
cols_per_row = 2
num_cols = len([col for col in data.drop(["Location", "Id"], axis=1).columns] )
rows = (num_cols + cols_per_row - 1) // cols_per_row #BCalculation of the number of required rows
fig, axes = plt.subplots(rows, cols_per_row, figsize=(12, 5 * rows))
axes = axes.flatten()
for i, col in enumerate(data.drop(["Location", "Id"], axis=1).columns) :
stats.probplot(data[col], dist="norm", plot=axes[i])
axes[i].set_title(f'{col}', fontsize=14)
#Hiding unnecessary axes when the number of columns is odd.
for j in range(num_cols, len(axes)):
axes[j].set_visible(False)
plt.tight_layout()
plt.show()
Interpretation of Skewness Values:¶
- A skewness close to 0 indicates a symmetric distribution.
- A positive skewness (> 0) suggests right-skewness (long tail towards higher values).
- A negative skewness (< 0) indicates left-skewness (long tail towards lower values).
General Conclusions for the Variables:¶
Right-Skewed Variables (Skewness > 0):¶
Most columns exhibit a high positive skewness, suggesting highly asymmetric distributions with a long tail towards higher values.
Highly Asymmetric Columns (Skewness > 7):¶
Columns such as:
Weekly Cases (17.33) Weekly Deaths (9.09) Daily People Vaccinated (10.69) Daily People Vaccinated per Hundred (13.67) Daily Vaccinations (9.28) Weekly Cases per Million, Total Vaccinations, etc.,have very high skewness values (> 7). These values highlight the presence of some extremely high observations (outliers).
Moderately Right-Skewed Columns:¶
Total Vaccinations per Hundred (0.51) Total Boosters per Hundred (0.93) These columns show moderate asymmetry, indicating that their distributions are less extreme but still not perfectly symmetric.
Left-Skewed Variables (Skewness < 0):¶
People Vaccinated per Hundred (-0.02) The skewness is very close to 0, suggesting an almost symmetric distribution for this column.
------------------------------------------------------------------------------------------------------------------------------¶
Probability Plot¶
Axes of the Diagram:¶
- X (Theoretical Quantiles): These represent the quantiles of a standard normal distribution (or another specified distribution).
- Y (Ordered Values): These are the sorted data values from the sample.
- Red Line (Reference Line): Represents an ideal normal distribution. If the blue points align with the red line, the data closely follows a normal distribution.
Deviations from the Line:¶
Points aligned on the red line: Indicate that the data fits a normal distribution well. Points deviating from the red line: Suggest that the data does not follow a normal distribution.
Findings Based on This Analysis:¶
It was observed that almost no variable in the dataset follows a normal distribution. Most distributions are asymmetric around their means, confirming deviations from normality.
I-II. Content analysis:¶
Relationships between independent variables and the target variable¶
data
| Id | Location | Weekly Cases | Year | Weekly Cases per Million | Weekly Deaths | Weekly Deaths per Million | Total Vaccinations | People Vaccinated | People Fully Vaccinated | Total Boosters | Daily Vaccinations | Total Vaccinations per Hundred | People Vaccinated per Hundred | People Fully Vaccinated per Hundred | Total Boosters per Hundred | Daily Vaccinations per Hundred | Daily People Vaccinated | Daily People Vaccinated per Hundred | Next Week's Deaths | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 911530868 | World | 2372.0 | 2020 | 0.300 | 65.0 | 0.008 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 344.0 |
| 1 | 807936902 | World | 5023.0 | 2020 | 0.635 | 114.0 | 0.014 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 361.0 |
| 2 | 773590408 | World | 5612.0 | 2020 | 0.710 | 116.0 | 0.015 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 431.0 |
| 3 | 130466459 | World | 7580.0 | 2020 | 0.958 | 153.0 | 0.019 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 463.0 |
| 4 | 544040446 | World | 8983.0 | 2020 | 1.136 | 187.0 | 0.024 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 506.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 129151 | 541829605 | Zimbabwe | 464.0 | 2022 | 29.012 | 13.0 | 0.813 | 11949993.0 | 6297324.0 | 4601845.0 | 1050824.0 | 5665.0 | 74.72 | 39.37 | 28.77 | 6.57 | 354.0 | 1427.0 | 0.009 | 9.0 |
| 129152 | 969939474 | Zimbabwe | 471.0 | 2022 | 29.449 | 12.0 | 0.750 | 11958771.0 | 6299348.0 | 4605821.0 | 1053602.0 | 5295.0 | 74.77 | 39.39 | 28.80 | 6.59 | 331.0 | 1362.0 | 0.009 | 7.0 |
| 129153 | 667902340 | Zimbabwe | 450.0 | 2022 | 28.136 | 13.0 | 0.813 | NaN | NaN | NaN | NaN | 5316.0 | NaN | NaN | NaN | NaN | 332.0 | 1483.0 | 0.009 | 5.0 |
| 129154 | 961193720 | Zimbabwe | 277.0 | 2022 | 17.320 | 6.0 | 0.375 | 11974313.0 | 6305470.0 | 4611113.0 | 1057730.0 | 5358.0 | 74.87 | 39.43 | 28.83 | 6.61 | 335.0 | 1633.0 | 0.010 | 7.0 |
| 129155 | 832612563 | Zimbabwe | 277.0 | 2022 | 17.320 | 6.0 | 0.375 | 11984914.0 | 6310089.0 | 4614738.0 | 1060087.0 | 6190.0 | 74.94 | 39.45 | 28.85 | 6.63 | 387.0 | 2102.0 | 0.013 | 8.0 |
129156 rows × 20 columns
unab_variablen = data.drop(["Id", "Next Week's Deaths"], axis=1)
target = data["Next Week's Deaths"]
cols_per_row = 2
num_cols = len([col for col in unab_variablen.columns] )
rows = (num_cols + cols_per_row - 1) // cols_per_row #Calculation of the number of required rows
fig, axes = plt.subplots(rows, cols_per_row, figsize=(12, 5 * rows))
axes = axes.flatten()
for i, col in enumerate(unab_variablen.columns):
sns.scatterplot(ax = axes[i], x=unab_variablen[col], y=target)
axes[i].set_title(f"{col} vs. Next Week's Deaths")
axes[i].set_xlabel(f"{col}")
axes[i].set_ylabel("Next Week's Deaths")
#Hiding unnecessary axes when the number of columns is odd.
for j in range(num_cols, len(axes)):
axes[j].set_visible(False)
plt.tight_layout()
plt.show()
correlation_matrix = data.drop('Location',axis=1).corr()
correlations_with_target = correlation_matrix["Next Week's Deaths"].drop("Next Week's Deaths")
plt.figure(figsize=(10,5))
sns.barplot(x= correlations_with_target.index, y=correlations_with_target.values, color = 'violet')
plt.title("Correlation plot of variables against *Next Week's Deaths*")
plt.xticks(rotation=90, ha='right')
plt.show()
The correlation plot reveals the relationship between Next Week's Deaths and other variables. Among the variables, some show a strong correlation with Next Week's Deaths, such as Weekly Deaths, which has a correlation of 0.9. On the other hand, variables like Weekly Cases, Daily Vaccinations, and Daily People Vaccinated exhibit a weaker correlation with Next Week's Deaths, indicating that they are less closely related
Relationships between independent variables¶
#Copy of original dataset
df = data.copy()
# Deletion of NaN values, as sns.pairplot() does not accept NaN values and is also very time-consuming, especially when there are many rows..
df.dropna(axis=0, inplace=True)
#df.drop("Id", axis=1, inplace=True)
df.head()
| Id | Location | Weekly Cases | Year | Weekly Cases per Million | Weekly Deaths | Weekly Deaths per Million | Total Vaccinations | People Vaccinated | People Fully Vaccinated | Total Boosters | Daily Vaccinations | Total Vaccinations per Hundred | People Vaccinated per Hundred | People Fully Vaccinated per Hundred | Total Boosters per Hundred | Daily Vaccinations per Hundred | Daily People Vaccinated | Daily People Vaccinated per Hundred | Next Week's Deaths | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 241 | 275164452 | World | 4174523.0 | 2020 | 527.800 | 77527.0 | 9.802 | 11875406.0 | 7231498.0 | 44680.0 | 1.0 | 897447.0 | 0.15 | 0.09 | 0.00 | 0.0 | 113.0 | 690726.0 | 0.009 | 81042.0 |
| 242 | 857254713 | World | 4424216.0 | 2021 | 559.369 | 79456.0 | 10.046 | 13722790.0 | 9050886.0 | 58460.0 | 9.0 | 1079269.0 | 0.17 | 0.11 | 0.00 | 0.0 | 136.0 | 735617.0 | 0.009 | 92754.0 |
| 243 | 515683834 | World | 4553174.0 | 2021 | 575.674 | 80332.0 | 10.157 | 17002186.0 | 11343354.0 | 191881.0 | 15.0 | 1303377.0 | 0.21 | 0.14 | 0.00 | 0.0 | 165.0 | 851085.0 | 0.011 | 94477.0 |
| 244 | 725478352 | World | 4619286.0 | 2021 | 584.033 | 79640.0 | 10.069 | 18569106.0 | 12578084.0 | 366880.0 | 23.0 | 1397939.0 | 0.23 | 0.16 | 0.00 | 0.0 | 177.0 | 845521.0 | 0.011 | 96212.0 |
| 245 | 844503137 | World | 4649535.0 | 2021 | 587.857 | 81042.0 | 10.246 | 20361402.0 | 14002427.0 | 650359.0 | 27.0 | 1581369.0 | 0.26 | 0.18 | 0.01 | 0.0 | 200.0 | 928498.0 | 0.012 | 96742.0 |
#correlation between variables
sns.pairplot(data=df, hue='Year')
<seaborn.axisgrid.PairGrid at 0x21daa2ff290>